| Chapter | Topic | Max points | Book Ch (K.V.) | Book Ch R for HDS | Assignment done? |
|---|---|---|---|---|---|
| 1 | Start me up! | 20 | 1 & 2 | - | ✗ |
| 2 | Regression and model validation | 5+15 | 3 & 4 | 7 | ✗ |
| 3 | Logistic regression | 5+15 | 5 & 6 | 7 | ✗ |
| 4 | Clustering and classification | 15+5 | - | 12, 17 & 18 | ✗ |
| 5 | Dimensionality reduction techniques | 5 + 15 | 13 & 14 | - | ✗ |
| 6 | Analysis of longitudinal data | 5 + 15 | 8 & 9 | - | ✗ |
| Misc | Misc |
Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 12 22:36:21 2022"
The text continues here.
Hello, testing this!! ZOM!
Hello World, why are my commits not pushing through?
I was feeling happy and motivated! I got R, Rstudio, Git and Github up and running. Now I see my commits on Github but unfortunately my diary is not updating. Apparently, according to Github, the updating of a public page may take up to 24h. I feel like that is a very long time and I wonder if commits made directly on Github (not remotely with Git) would take as long.
I discovered the IODS-course on an email sent to all doctoral schools. I’m looking forward to learning about rMarkdown, project management in Github, and clustering.
Looking forward to this course!
Update: I got to attend our course IT-help-zoom. It was great! Even though my diary did not update despite our efforts. Apparently everything looks good and my diary should update if I wait some more. The info was reassuring so I’ll be waiting more serenely now!
Meanwhile, I’ll keep myself busy with R for Health Data Science as suggested during the course introduction lecture. It seems very informative and easy to read. I’m somewhat familiar with R but I haven’t used tidyverse much. Piping gets me a bit confused but it’s my favorite topic! Exercise Set 1 is interesting and useful, but it did require some package installations, but I’m good to go now and ready to keep working now that all packages are…packed along!
Update v2 We did it! The diary has updated itself after 2h 13min of me patiently waiting (and eagerly non-stop pushing new commits).
Link to my Github repo https://github.com/hsanez/IODS-project
Yay! A happy girl at a laptop By Marina Vishtak
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 12 22:36:21 2022"
Here we go again…
Load the needed R packages before getting to work: Obs! You might need to install the packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr"))
library(tidyverse)
library(GGally)
I created a new data folder in my IODS-project folder. It includes an R script named ‘create_learning2014.R’, which is the R code for this data wrangling part of Assignment 2. The original data I’m using is the full learning2014 data downloaded from here. The data includes headers and uses tab as a separator. As per the course assignment instructions I have commented the data and my code in ‘create_learning2014.R’.
✓ Done!
In this part we created an analysis dataset which contains 7 variables: gender, age, attitude, deep, stra, surf and points. Some of these (attitude, deep, stra, surf, points) were made by combining questions in the learning2014 data, as defined in our Exercise Set and also on the bottom part of this page. The combination variables were scaled and observations where points = 0 were excluded. The data has now 166 observations and 7 variables. See my full comments and code in ‘create_learning2014.R’.
✓ Done!
I set the working directory of my R session to the IODS Project folder with setwd() and then saved the analysis dataset to the data folder as learning.csv with write_csv() (readr package, part of tidyverse). ?write_csv was a good help. With read_csv(), str(), dim() and head() I made sure the data was correctly written. Again, see my full comments and code in ‘create_learning2014.R’.(3 points)
✓ Done!
R Markdown Hint: When you knit the document, the working directory is temporarily set to be the folder where your R Markdown file is located. This is good to be aware of when reading data for example.
First we start by reading the analysis data to R. Make sure all required packages are installed (see beginning of Ch2). The data we’re using is the one created in the data wrangling part of assignment 2. The data can be also read directly from this page, where the separator is a comma and headers are included.
#load required packages
library(tidyverse)
library(GGally)
# Read data (Make sure you're in the correct working folder with getwd())
lrn14a <- read_csv("data/learning2014.csv")
# Alternative site for reading the data
#lrn14a <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt")
#Structure of the data
str(lrn14a)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Dimensions and first peek at the data
dim(lrn14a);head(lrn14a)
## [1] 166 7
## # A tibble: 6 × 7
## gender age attitude deep stra surf points
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 53 3.7 3.58 3.38 2.58 25
## 2 M 55 3.1 2.92 2.75 3.17 12
## 3 F 49 2.5 3.5 3.62 2.25 24
## 4 M 53 3.5 3.5 3.12 2.25 10
## 5 M 49 3.7 3.67 3.62 2.83 22
## 6 F 38 3.8 4.75 3.62 2.42 21
The analysis data is a subset of a full data called ‘learning2014’. The full data is the result of a survey on attitude towards studying and statistics made between 3.12.2014 - 1.10.2015 for students of an introductory course to social statistics.
The results are student’s answers to questions on different subtopics of learning, and some of these subtopics are summarized as new variables in our analysis dataset. This subset we created and read, ‘learning2014.csv’, is an analysis dataset which contains 7 variables: gender, age, attitude, deep, stra, surf and points. Points means the maximum exam points of the student.
Some of these variables were made by combining questions in the learning2014 data
See the survey questions and combinations made for the above mentioned variables on this page.
The summary variables were scaled to a Likert scale (scale from 1 to 5) and observations (students) with 0 exam points were excluded.
The remaining subset of data has 166 observations (students) and 7 variables.
We will look at a graphical overview of our data. We have 7 variables of which gender seems to be a dichotomous variable without missing values, and more female than male students:
summary(lrn14a$gender); unique(lrn14a$gender); table(lrn14a$gender)
## Length Class Mode
## 166 character character
## [1] "F" "M"
##
## F M
## 110 56
We’ll use this information to split our graphical output in two (male and female students) and draw summary plots of the remaining variables:
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14a, mapping = aes(col=lrn14a$gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
In this plot matrix we can see all variables split by gender, boxplots on top, and on the far left histograms, then scatters and density plots, and finally Pearson correlation data with statistical significance marked with asterisks (see GGally documentation).
As we can see, there are
For Pearson correlation results we can see both an overall and gender specific correlation coefficients.
We’ll proceed to creating a regression model. As instructed we’ll choose three explanatory variables to our model where exam points is the dependent (outcome) variable. I’ll base my choice on the correlation information above, and I’ll try to avoid possible collinearity between the chosen variables. Since there are multiple explanatory (independent) variables, we’ll use a multiple regression model approach.
Dependent variable:
Independent variables:
library(ggplot2)
p <- ggpairs(lrn14a, mapping = aes(col=lrn14a$gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# create a regression model with multiple explanatory variables
reg_model <- lm(points ~ attitude + stra + surf, data = lrn14a)
# print out a summary of the model
summary(reg_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the summary of the model we can see that
The t values are t statistics that have been calculated as estimate/standard error.
The F-statistic has a very low p-value, meaning that there is evidence that at least not all of the coefficients of the studied explanatory variables are zero.
R-squared tells us about the fit of the model to our data. Here, since we have used multiple explanatory variables, we’ll look at the adjusted R-squared which takes into account and corrects for the variance caused by the multiple variables used in a regression model. Here we see an adjusted R-square of 0.19, meaning that the three chosen explanatory variables account for approx. 20% of the variation in exam points.
Since only attitude seems to be statistically significantly associated with the dependent variable, we’ll rerun the regression model. We’ll keep attitude as an explanatory variable and instead of surf and stra, we’ll test age and gender as explanatory variables.
# create a regression model with multiple explanatory variables
reg_model_extra <- lm(points ~ attitude + stra + surf, data = lrn14a)
# print out a summary of the model
summary(reg_model_extra)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Since only attitude seems to be statistically significantly associated with the dependent variable, points, we’ll run the regression model once more, this time only with attitude as an explanatory variable, to fit the model better and get a more parsimonious model. We’ll also draw a scatter plot with a regression line to visualize the result.
# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = lrn14a) + geom_smooth(method = "lm")
# create a regression model with multiple explanatory variables
reg_model2 <- lm(points ~ attitude, data = lrn14a)
# print out a summary of the model
summary(reg_model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The regression line fits the scatter plot quite nicely, with a positive slope and tight 95% confidence intervals. Again, we see that attitude is statistically significantly associated with points (p=4.1E-9), and the amount of points rise by 1 when attitude points based on the original survey rise by approx 3.5. The estimate is a little bit higher and the p-value a little bit lower than before. Since we have only one explanatory variable we can assess the multiple R squared of the model (0.19), which should now equal to the square of the correlation coefficient between attitude and points. This means that almost 20% of points can be explained by attitude.
We have fitted a regression model to our data and interpreted the results. Now we will assess the model to check whether it complies to our statistic assumptions, and that it is a sensible model to use on our data.
Graphical model validation with plot():
# par() could be used to show plots in a matrix
# par(mfrow = c(2,2))
# draw diagnostic plots using the plot() function.
plot(reg_model2, which=c(1,2,5))
Linear regression modeling has four main assumptions (quoted from R for Health Data Science):
List of diagnostic plots for plot() and which():
| which | graphic |
|---|---|
| 1 | Residuals vs Fitted values |
| 2 | Normal QQ-plot |
| 3 | Standardized residuals vs Fitted values |
| 4 | Cook’s distances |
| 5 | Residuals vs Leverage |
| 6 | Cook’s distance vs Leverage |
✓ ✓ ✓ ✓ ✓ Done!
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 12 22:36:35 2022"
Here we go yet again…
Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr"))
Load the needed R packages before getting to work:
#load required packages
library(tidyverse)
library(GGally)
library(dplyr)
library(ggplot2)
See create_alc.R on my Github repository.
The purpose of this analysis is to study the relationships between high/low alcohol consumption and some of the other variables in our data.
Read the analysis data.
# Read data (Make sure you're in the correct working folder with getwd())
alc <- read_csv("data/student_alc.csv")
# Alternative site for reading the data
#alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv")
Describe the data.
# print variable names and dimension of the tibble
colnames(alc); dim(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
## [1] 370 35
The analysis dataset we are using for this course assignment consists of the above listed 35 variables and 370 observations.
Our analysis dataset is a subset of an original data ‘Student Performance Data Set’ by Prof. Cortez (University of Minho, Portugal) collected from two Portuguese schools during the school year 2005-2006. The original dataset was collected using questionnaires and school reports.
The analysis dataset was created by joining two datasets of the original data; student data from two school subjects: Mathematics and Portuguese language. These subject-specific datasets were merged by their common variables, and only students present in both subject datasets were included in the resulting analysis dataset. Naturally there was variation in the following time-related variables:
For the numeric variables we calculated student-specific averages using both of the datasets (Mathematics and Portuguese language). For the binary variable (paid) we chose the value from the Mathematics dataset.
See the R code for the exact course of our data wrangling.
You can find the variable descriptions and more information about (and download) the original dataset on this page.
My chosen variables:
The original dataset includes information on students from 15 to 22 years. This means all student’s aren’t of legal drinking age, which seems to be 18 in Portugal. They won’t be able to buy alcohol by themselves which could be seen in the relationship between age and alcohol use; there should be less alcohol use among younger students.
In average, men are larger than women, and their body composition differs, which makes their physiology and thus alcohol metabolism different from women’s. This results in male tolerance for alcohol being usually better, which may lead to them drinking more before feeling the effects of alcohol. For this possible reason I think there might be a relationship between sex and alcohol use.
Students might spend a lot of time at home both studying or during their freetime. Thus, bad family relations could affect studying possibilities, motivation and recovering from studying. One reason for bad family relationships may be alcohol itself, maybe overused by the student itself or by other family members.
Students drinking more alcohol might have more absences from school. Since weekday alcohol consumption is measured it could be seen as absences during the school week. Friday or Saturday use might not come up as absences in the data, assuming that these schools in Portugal have a regular Monday to Friday school week schedule.
My hypotheses:
# Load needed libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(GGally)
# glimpse at the data
glimpse(alc); dim(alc)
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## [1] 370 35
# create a vector of the chosen variables + alcohol use
myvars <- c('age', 'sex', 'famrel', 'absences', 'alc_use', 'high_use')
# pivot data and draw a bar plot of the chosen variables
gather(alc[myvars]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
# draw summary table grouped by sex and low/high use of alcohol consumption
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = round(mean(age),1), mean_family_relation_quality = round(mean(famrel),1), mean_absences = round(mean(absences),1))
## # A tibble: 4 × 6
## # Groups: sex [2]
## sex high_use count mean_age mean_family_relation_quality mean_absences
## <chr> <lgl> <int> <dbl> <dbl> <dbl>
## 1 F FALSE 154 16.6 3.9 4.3
## 2 F TRUE 41 16.5 3.7 6.9
## 3 M FALSE 105 16.3 4.1 2.9
## 4 M TRUE 70 16.9 3.8 6.1
# draw summary plot matrix stratified for sex as an overview of the data and correlations
p <- ggpairs(alc[myvars], mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
Nicely hypothesized - or - as Jorma Uotinen might say:
Our outcome variable is dichotomous, or binary, so we will assess its relationship with our chosen variables with a logistic regression model.
Dependent variable:
Independent variables:
library(dplyr); library(ggplot2)
# First, we define our logistic regression model m
# We'll use glm(), a generalized linear model function so we need to specify our model type to be logistic -> family="binomial", which makes the function use the link option 'logit'.
m <- glm(high_use ~ age + sex + famrel + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + famrel + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2151 -0.8371 -0.5950 1.0577 2.1545
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.44606 1.75957 -1.958 0.050175 .
## age 0.17148 0.10306 1.664 0.096143 .
## sexM 1.09171 0.24864 4.391 1.13e-05 ***
## famrel -0.32224 0.12913 -2.495 0.012579 *
## absences 0.08915 0.02298 3.879 0.000105 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 407.03 on 365 degrees of freedom
## AIC: 417.03
##
## Number of Fisher Scoring iterations: 4
# Compute odds ratios (OR) and their confidence intervals (CI) for each variable
OR <- coef(m) %>% exp %>% round(2)
CI <- confint(m) %>% exp %>% round(2)
# print ORs and CIs as a table
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03 0.00 0.98
## age 1.19 0.97 1.46
## sexM 2.98 1.84 4.89
## famrel 0.72 0.56 0.93
## absences 1.09 1.05 1.15
Sex, Family relations and absences seem to have a statistically significant association with high alcohol use (p<0.01).
Odds ratio is $ e^{}$
Age has no statistically significant association with high level of alcohol use.
My hypotheses:
In summary, three of my original hypothesis seem to stay valid after our correlation and regression analyses.
One hypothesis down! Oh no! But no worries…
To predict the power of our model we will create a ‘confusion matrix’, some graphs (bar plot, scatter plot and ROC curve), and compute the training error of the model. Finally we’ll compare the performance of the model with a simple guessing strategy. Later, in part 7 of this report, we’ll test the performance of the model with cross-validation.
As instructed, we’ll use only the variables that showed statistical significance: sex, famrel, absences. Age had no statistically significant association with alcohol use level based on our regression model.
# predict() the probability of high_use
# We're using our model 'm' as the object of the predict()
# type = 'response' defines that we want our prediction results on the scale of the response variable, instead of the default scale. This means that when dealing with a binomial target variable (e.g. our 'high_use') we will get our prediction results as predicted probabilities instead of logit-scaled probabilities (default for binomial ).
probabilities <- predict(m, type = "response")
library(dplyr)
# add the predicted probabilities to our dataframe 'alc' in a column named 'probability'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
# Let's set our prediction treshold: If the probability is more than 50%, prediction = TRUE, otherwise it's FALSE.
alc <- mutate(alc, prediction = probabilities >0.5)
# check the last ten original classes, predicted probabilities, and class predictions
select(alc, sex, famrel, absences, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 6
## sex famrel absences high_use probability prediction
## <chr> <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 M 4 3 FALSE 0.387 FALSE
## 2 M 4 0 FALSE 0.405 FALSE
## 3 M 5 7 TRUE 0.437 FALSE
## 4 F 5 1 FALSE 0.132 FALSE
## 5 F 4 6 FALSE 0.247 FALSE
## 6 F 5 2 FALSE 0.165 FALSE
## 7 F 4 2 FALSE 0.187 FALSE
## 8 F 1 3 FALSE 0.398 FALSE
## 9 M 2 4 TRUE 0.568 TRUE
## 10 M 4 2 TRUE 0.406 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 247 12 259
## TRUE 76 35 111
## Sum 323 47 370
# transform to proportions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins() %>% round(2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67 0.03 0.70
## TRUE 0.21 0.09 0.30
## Sum 0.87 0.13 1.00
Almost one third of students drink alcohol an amount classifiable as ‘high use’ and our model underestimates the amount by predicting that only 11% of students use that much alcohol.
We can visualize our observations and predictions with a bar plot:
# initialize a plot of 'prediction' (predicted outcomes)
#basic colors
#p2 <- ggplot(data = alc, aes(x=prediction, col=prediction, fill=prediction))
# Custom colors
pred_colors <- c("#FFA500","#00FF00")
p2 <- ggplot(data = alc, aes(x=prediction, fill=prediction)) +
scale_fill_manual(values=pred_colors)
# draw a bar plot of high_use by 'high use' (observations)
# delineate outcome variable labels
## define new labels
outlab <- c('True negative','True positive')
## Define old labels in same order
names(outlab) <- c('FALSE','TRUE')
p2 + geom_bar() + facet_wrap("high_use", labeller=labeller(high_use=outlab))
There are more FALSE observations (True negative observation = no high
use of alcohol, left panel) than TRUE (True positive observation = high
use of alcohol, right panel) observations. The prediction of true
negative observations seems to be better than the prediction of true
positive observations.
This can be presented also with a sleeker scatter plot:
# initialize a plot of 'high_use' versus 'probability' in 'alc'
p3 <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
# define the geom as points and draw the plot
p3 + geom_point()
We can also visualize our confusion matrix with a ROC (receiver operating characteristic) curve. This will show the performance of our classification model with the true positive and false positive rate.
We’ll visualize the confusion matrix (sensitivity and specificity) with a ROC (receiver operating characteristic) curve. This will help us assess how well our logistic regression model fits the dataset.
The ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR).
The closer the curve comes to the upper left corner of the plot, the more accurate the test is. THe closer to the grey 45’ line the curve gets, the less accurate the test is.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Define the curve with function roc(response, predictor), see ?roc() and pROC documentation (Resources)
roc_obj <- roc(alc$high_use, alc$probability)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
# Draw the plot
plot(roc_obj)
We can see that the curve differs from the 45’ curve, which means that
our model seems to have at least some predictive value.
Training error
To calculate the training error (=the total proportion of inaccurately classified individuals) of our model, we’ll run the loss function (mean prediction error) presented in our course exercises loss_func(response, probability). It will take into account a binomial response/classification variable (‘high_use’) and the probability of the response to be TRUE for each individual.
abs() gives the absolute value of the difference between class and prob, so if it is >0.5, the prediction has been wrong.
The function assigns the logicals ‘TRUE’ or ‘FALSE’ depending on whether the model’s prediction is false or true. OBS! The function assigns now ‘TRUE’ = 1, when the prediction has been false! Then it calculates the mean prediction error which is the number of false predictions divided by the number of total observations.
\[ \frac{n(FP + FN)*1 + n(TP + TN)*0}{N} \]
The function:
# define the loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
# print(n_wrong) #print this if you want to see the logicals the final result mean() is based on
mean(n_wrong)
}
# call loss_func() to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378
A simple guessing strategy might be just guessing that no student ever drinks enough for it to be classified as ‘high use’ (high_use = FALSE, probability(high_use=TRUE)= 0) or all drinking very much (high_use = TRUE, probability(high_use=TRUE) = 1) or that every other student drinks (probability(high_use=TRUE) = 0.5). These situations can be tested with the loss_func() by determining a constant individual probability (prob) as described above.
# No students' drinking habits are 'high use', high_use=FALSE, probability(high_use=TRUE)= 0
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# All students' drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 1
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
# Every other student's drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 0.5
loss_func(class = alc$high_use, prob = 0.49)
## [1] 0.3
loss_func(class = alc$high_use, prob = 0.51)
## [1] 0.7
# Every third student's drinking habits are 'high use', high_use=TRUE, probability(high_use=TRUE)= 0.3
loss_func(class = alc$high_use, prob = 0.3)
## [1] 0.3
The training error is 23.0%, which means that the models accuracy (1 - training error) is approximately 76%. By simply guessing how many students’ drinking habits are classified ‘high use’ we did not reach higher accuracy unless we guessed that every other students drinks:
| Guess classified as ‘high use’ | Guessed probability of ‘high use’ drinking | Training error | Accuracy |
|---|---|---|---|
| No one drinks | 0 | 0.3 | 70% |
| Everybody drinks | 1 | 0.7 | 30% |
| Every other drinks* | 0.49 & 0.51 | 0.3 & 0.7 | 70% & 30% |
| Every third drinks | 0.3 | 0.3 | 70% |
OBS! *This assessment of exactly every other drinking (prob=0.50) can’t be done since the function is defined with a 0.5 treshold. Thus the close approximates were used.
In summary, our model seems to be better in predicting the level of alcohol use of our students than simple guessing.
Quoted from our course material: Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good. Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2378378
# K-fold cross-validation
library(boot)
cvK10 <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cvK10$delta[1]
## [1] 0.2648649
The mean prediction error (0.25) combined from the rounds on the testing data seems to be a little bit lower than on the training data (0.24).
The model introduced in our Exercise set 3 had about 0.25 mean prediction error on the 10-fold cross validation on the testing data which implies that our new model (variables: sex, famrel, absences) has a similar test set performance to the one used in the Exercise set (variables: sex, failures, absences).
We’ll do a comparison of 11 different testing models, substracting the number of predictors in each model by always keeping the most significant variables (dropping the last 5 variables then later excluding only one variable per model).
Defining the models (logistic regression) and assigning predictions based on prediction probabilities
# check all possible variables
# -> 31 possible independent variables (we'll omit Daily and Weekend alcohol use since they were used for composing our outcome variable!)
colnames(alc)
# models and summaries
# detecting the least statistically significant variables
m1 <- glm(high_use ~ school + sex + age + address + famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup + famsup + activities + nursery + higher + internet + romantic + famrel + freetime + goout + health + failures + paid + absences + G1 + G2 + G3 , data = alc, family ="binomial")
summary(m1)
piis <- summary(m1)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + famrel + absences + reason + sex + guardian + address + activities + freetime + Mjob + health + paid + famsize + romantic + Fjob + studytime + traveltime + nursery + failures + G1 + school + Fedu + G2 + age + Pstatus + G3 + higher + internet + famsup + Medu + schoolsup
m2 <- glm(high_use ~ goout + famrel + absences + reason + sex + guardian + address + activities + freetime + Mjob + health + paid + famsize + romantic + Fjob + studytime + traveltime + nursery + failures + G1 + school + Fedu + G2 + age + Pstatus + G3, data = alc, family ="binomial")
summary(m2)
piis <- summary(m2)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + absences + famrel + reason + sex + guardian + address + activities + freetime + Mjob + paid + health + famsize + studytime + traveltime + nursery + romantic + failures + G1 + school + Fjob + Fedu + G2 + age + Pstatus + G3
m3 <- glm(high_use ~ goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob + traveltime + nursery + school + failures + G1, data = alc, family ="binomial")
summary(m3)
piis <- summary(m3)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob + traveltime + nursery + school + failures + G1
m4 <- glm(high_use ~ goout + absences + famrel + reason + sex + guardian + address + activities + paid + Mjob + health + freetime + famsize + romantic + studytime + Fjob, data = alc, family ="binomial")
summary(m4)
piis <- summary(m4)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + absences + famrel + sex + reason + address + guardian + health + Mjob + studytime + activities + paid + famsize + romantic + Fjob + freetime
m5 <- glm(high_use ~ goout + absences + famrel + sex + reason + address + guardian + health + Mjob + studytime + activities, data = alc, family ="binomial")
summary(m5)
piis <- summary(m5)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + absences + famrel + sex + reason + guardian + address + activities + studytime + health + Mjob
m6 <- glm(high_use ~ goout + absences + famrel + sex + reason + guardian, data = alc, family ="binomial")
summary(m6)
piis <- summary(m6)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + sex + absences + famrel + reason + guardian
m7 <- glm(high_use ~ goout + sex + absences + famrel + reason, data = alc, family ="binomial")
summary(m7)
piis <- summary(m7)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + sex + absences + famrel + reason
m8 <- glm(high_use ~ goout + sex + absences + famrel, data = alc, family ="binomial")
summary(m8)
piis <- summary(m8)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + sex + absences + famrel
m9 <- glm(high_use ~ goout + sex + absences, data = alc, family ="binomial")
summary(m9)
piis <- summary(m9)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + sex + absences
m10 <- glm(high_use ~ goout + sex, data = alc, family ="binomial")
summary(m10)
piis <- summary(m10)[12]
piis.sort <- as.data.frame(piis)
names(piis.sort) <- c("Estimate","SE","Tval","Pval")
psort <- arrange(piis.sort,Pval)
# sorted order: goout + sex
m11 <- glm(high_use ~ goout, data = alc, family ="binomial")
summary(m11)
# predict() the probability of high_use
probabilities1 <- predict(m1, type = "response")
probabilities2 <- predict(m2, type = "response")
probabilities3 <- predict(m3, type = "response")
probabilities4 <- predict(m4, type = "response")
probabilities5 <- predict(m5, type = "response")
probabilities6 <- predict(m6, type = "response")
probabilities7 <- predict(m7, type = "response")
probabilities8 <- predict(m8, type = "response")
probabilities9 <- predict(m9, type = "response")
probabilities10 <- predict(m10, type = "response")
probabilities11 <- predict(m11, type = "response")
# add the predicted probabilities to our dataframe 'alc' in a column named 'probability'
alc <- mutate(alc, probability1 = probabilities1)
alc <- mutate(alc, probability2 = probabilities2)
alc <- mutate(alc, probability3 = probabilities3)
alc <- mutate(alc, probability4 = probabilities4)
alc <- mutate(alc, probability5 = probabilities5)
alc <- mutate(alc, probability6 = probabilities6)
alc <- mutate(alc, probability7 = probabilities7)
alc <- mutate(alc, probability8 = probabilities8)
alc <- mutate(alc, probability9 = probabilities9)
alc <- mutate(alc, probability10 = probabilities10)
alc <- mutate(alc, probability11 = probabilities11)
# use the probabilities to make a prediction of high_use
# Let's set our prediction treshold: If the probability is more than 50%, prediction = TRUE, otherwise it's FALSE.
alc <- mutate(alc, prediction1 = probabilities1 >0.5)
alc <- mutate(alc, prediction2 = probabilities2 >0.5)
alc <- mutate(alc, prediction3 = probabilities3 >0.5)
alc <- mutate(alc, prediction4 = probabilities4 >0.5)
alc <- mutate(alc, prediction5 = probabilities5 >0.5)
alc <- mutate(alc, prediction6 = probabilities6 >0.5)
alc <- mutate(alc, prediction7 = probabilities7 >0.5)
alc <- mutate(alc, prediction8 = probabilities8 >0.5)
alc <- mutate(alc, prediction9 = probabilities9 >0.5)
alc <- mutate(alc, prediction10 = probabilities10 >0.5)
alc <- mutate(alc, prediction11 = probabilities11 >0.5)
The order of the variables (p-value) changed between regressions. However, the ‘top’ variables’ significance stayed and they kept their top positions on the regression summaries through all the regressions. The less significant variables changed rankings often, and thus they happened to be the ones to primarily get discarded from the model.
Nota bene!
I didn’t get my for-loops nor dynamic functions to work so part 8 scripts have been and will continue being ugly…sorry! But don’t worry, since…
Kittycat suffering from ugly-scriptitis
Training and testing errors by model
# Models numbered by amount of variables used
modelss <- c(31,26,21,16,11,6,5,4,3,2,1)
# call loss_func() to compute the average number of wrong predictions in the (training) data
pred_train <- loss_func(class = alc$high_use, prob = alc$probability1)
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability2))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability3))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability4))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability5))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability6))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability7))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability8))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability9))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability10))
pred_train <- append(pred_train, loss_func(class = alc$high_use, prob = alc$probability11))
#10-fold cross validation on testing data
cvK10_ss1 <- cv.glm(data = alc, cost = loss_func, glmfit = m1, K = 10)
cvK10_ss2 <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cvK10_ss3 <- cv.glm(data = alc, cost = loss_func, glmfit = m3, K = 10)
cvK10_ss4 <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
cvK10_ss5 <- cv.glm(data = alc, cost = loss_func, glmfit = m5, K = 10)
cvK10_ss6 <- cv.glm(data = alc, cost = loss_func, glmfit = m6, K = 10)
cvK10_ss7 <- cv.glm(data = alc, cost = loss_func, glmfit = m7, K = 10)
cvK10_ss8 <- cv.glm(data = alc, cost = loss_func, glmfit = m8, K = 10)
cvK10_ss9 <- cv.glm(data = alc, cost = loss_func, glmfit = m9, K = 10)
cvK10_ss10 <- cv.glm(data = alc, cost = loss_func, glmfit = m10, K = 10)
cvK10_ss11 <- cv.glm(data = alc, cost = loss_func, glmfit = m11, K = 10)
# average number of wrong predictions in the cross validation
cvK10$delta[1]
## [1] 0.2648649
cvK10_ss <- cvK10_ss1$delta[1]
cvK10_ss <- append(cvK10_ss,cvK10_ss2$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss3$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss4$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss5$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss6$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss7$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss8$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss9$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss10$delta[1])
cvK10_ss <- append(cvK10_ss,cvK10_ss11$delta[1])
# create a tibble
supbon <- tibble(modelss, pred_train, cvK10_ss)
Trends of both training and testing errors by the number of predictors in the model
# Pivot data
longdf <- pivot_longer(supbon, c('pred_train','cvK10_ss'))
# initialize and draw a plot
ggplot(data=longdf, aes(x=modelss, y=value, col=name)) + geom_point() + geom_line() +
# Edit legend title and labels
scale_color_discrete(name = "Mean prediction error", labels = c("testing", "training")) +
# Edit label names and title
labs(x="Model", title='Trends of training and testing errors by the number of predictors in the model')
The mean prediction error is:
\(\rightarrow\) The predictive logarithmic regression model is at its best with not too few nor too many variables \(\rightarrow\) 3-5 variables seems like a good option for model construction.
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Done!
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 12 22:36:47 2022"
Here we go again…
Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot"))
Load the needed R packages before getting to work:
#load required packages
library(tidyverse)
library(GGally)
library(dplyr)
library(ggplot2)
library(MASS)
library(corrplot)
# load the 'Boston' data
data("Boston")
# explore the dataset
str(Boston);dim(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506 14
The ‘Boston’ dataset from the MASS library is a data frame of Housing Values in Suburbs of Boston. It has 506 observations (rows) and 14 variables (columns).
The variables are:
The R documentation of the dataset can be found here.
# summary of the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# data from wide to long
long_Boston <- pivot_longer(Boston, cols=everything(), names_to = 'variable', values_to = 'value')
# Histograms of all variables in use
p1 <- ggplot(data=long_Boston)
p1 + geom_histogram(mapping= aes(x=value)) +
facet_wrap(~variable, scales="free")
# Summary plot matrix with correlation and density plots
p2 <- ggpairs(Boston, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p2
All variables of the dataset ‘Boston’ are numeric. Most of the variables are skewed, so we’ll use Spearman’s correlation for correlation assessment later. There are multiple variables with possible outliers. Correlation data is difficult to see from this plot with this many variables and shows Pearson’s correaltions by default, so we’ll draw another one using the cor() (from the ‘corrplot’ library) focusing on the correlations between variables.
# calculate the correlation matrix
cor_matrix <- cor(Boston, method='spearman')
# print the correlation matrix
cor_matrix %>% round(1)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## crim 1.0 -0.6 0.7 0.0 0.8 -0.3 0.7 -0.7 0.7 0.7 0.5 -0.4 0.6
## zn -0.6 1.0 -0.6 0.0 -0.6 0.4 -0.5 0.6 -0.3 -0.4 -0.4 0.2 -0.5
## indus 0.7 -0.6 1.0 0.1 0.8 -0.4 0.7 -0.8 0.5 0.7 0.4 -0.3 0.6
## chas 0.0 0.0 0.1 1.0 0.1 0.1 0.1 -0.1 0.0 0.0 -0.1 0.0 -0.1
## nox 0.8 -0.6 0.8 0.1 1.0 -0.3 0.8 -0.9 0.6 0.6 0.4 -0.3 0.6
## rm -0.3 0.4 -0.4 0.1 -0.3 1.0 -0.3 0.3 -0.1 -0.3 -0.3 0.1 -0.6
## age 0.7 -0.5 0.7 0.1 0.8 -0.3 1.0 -0.8 0.4 0.5 0.4 -0.2 0.7
## dis -0.7 0.6 -0.8 -0.1 -0.9 0.3 -0.8 1.0 -0.5 -0.6 -0.3 0.2 -0.6
## rad 0.7 -0.3 0.5 0.0 0.6 -0.1 0.4 -0.5 1.0 0.7 0.3 -0.3 0.4
## tax 0.7 -0.4 0.7 0.0 0.6 -0.3 0.5 -0.6 0.7 1.0 0.5 -0.3 0.5
## ptratio 0.5 -0.4 0.4 -0.1 0.4 -0.3 0.4 -0.3 0.3 0.5 1.0 -0.1 0.5
## black -0.4 0.2 -0.3 0.0 -0.3 0.1 -0.2 0.2 -0.3 -0.3 -0.1 1.0 -0.2
## lstat 0.6 -0.5 0.6 -0.1 0.6 -0.6 0.7 -0.6 0.4 0.5 0.5 -0.2 1.0
## medv -0.6 0.4 -0.6 0.1 -0.6 0.6 -0.5 0.4 -0.3 -0.6 -0.6 0.2 -0.9
## medv
## crim -0.6
## zn 0.4
## indus -0.6
## chas 0.1
## nox -0.6
## rm 0.6
## age -0.5
## dis 0.4
## rad -0.3
## tax -0.6
## ptratio -0.6
## black 0.2
## lstat -0.9
## medv 1.0
# visualize the correlation matrix
# cl.pos= position of the color legend, e.g. cl.pos = 'b'
# tl.pos= position of the text labels
# tl.cex = size of text labels
# type = type of plot, 'full', 'upper', 'lower'
# method = the visualization of the correlation coefficients, 'circle', 'square', 'pie', 'number', etc.
# Basic one-style correlation plot: corrplot(cor_matrix, method="ellipse", tl.pos ="d", tl.cex=0.6)
# Cool mixed correlation plot:
corrplot.mixed(cor_matrix, lower = 'number', upper = 'ellipse',tl.pos ="lt", tl.col='black', tl.cex=0.6, number.cex = 0.5)
The darker the color (see color legend, either blue or red) and the more straight-line like ellipse, the larger the correlation coefficient between variables. The exact coefficient can be seen on the lower part of the plot. The slope and the color (blue vs. red) of the ellipses depict the direction (positive vs negative) of the correlation. Unfortunately, with cor() we don’t have the p-values for the correlations.
There seems to be some strong negative (= more of A correlates to more of B) correlations between:
There are some positive (= more of A correlates to less of B) correlations too:
In summary, some examples of possible correlations:
We’ll now standardize the data by scaling it. Boston data has only numerical variables, so we’ll be able to standardize them all in one go with scale().
Quoted from our course material In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.
\[scaled(x) = \frac{x - mean(x)}{ sd(x)}\]
# center and standardize variables with scale()
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled <- as.data.frame(boston_scaled)
After scaling, the means of all variables is standardized to zero (0). Scaling the data with different original scales makes the comparison and analysis of the variables easier and possible. The scale() function transforms the data into a matrix and array so for further analysis we changed it back to a data frame.
Let’s replace the continuous crime rate variable with a categorical variable of low, medium low, medium high and high crime rate (scaled).
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
#check new dataset
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
We’ll now divide the dataset to training (80%) and testing (20%) datasets for further model fitting and testing.
# number of rows in the scaled Boston dataset
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set (random 80% of the scaled Boston data)
train <- boston_scaled[ind,]
# create test set (random 20% of the scaled Boston data)
test <- boston_scaled[-ind,]
We’ll fit the linear discriminant analysis (LDA), a classification (and dimension reduction) method, on the training set.
# linear discriminant analysis on the training set
# target variable = crime
# all other variables (.) are the predictor variables
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2500000 0.2549505 0.2524752
##
## Group means:
## zn indus chas nox rm age
## low 1.0169598 -0.9229374 -0.15180559 -0.8778800 0.47012031 -0.8863737
## med_low -0.1463457 -0.2577223 -0.07742312 -0.5747185 -0.07093808 -0.4104997
## med_high -0.3690158 0.1730101 0.18636222 0.4187885 0.06104304 0.4369230
## high -0.4872402 1.0149946 -0.11793298 1.0187470 -0.37704560 0.8085724
## dis rad tax ptratio black lstat
## low 0.8868719 -0.6935825 -0.7246173 -0.47341669 0.3867279 -0.79074192
## med_low 0.2866058 -0.5543231 -0.4732602 0.02053927 0.3070947 -0.22882499
## med_high -0.3944342 -0.4589285 -0.3482525 -0.34225826 0.1159547 0.01320748
## high -0.8427242 1.6596029 1.5294129 0.80577843 -0.7053337 0.91034902
## medv
## low 0.52610280
## med_low 0.05478108
## med_high 0.16206130
## high -0.71617412
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11321850 0.755645402 -1.042665174
## indus 0.08999244 -0.192398707 0.340064195
## chas -0.13223567 -0.053073367 -0.006048574
## nox 0.08614333 -0.811847967 -1.290935906
## rm -0.10740586 -0.007105597 -0.085065226
## age 0.29507475 -0.322876441 -0.372540105
## dis -0.11804195 -0.289877209 0.007835232
## rad 4.14620182 0.875773433 -0.194856619
## tax 0.03830201 -0.008206037 0.598167786
## ptratio 0.09425565 0.072637571 -0.167518757
## black -0.10436178 0.062644474 0.107438174
## lstat 0.15892576 -0.211133827 0.316119430
## medv 0.13350937 -0.434985561 -0.085162469
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9610 0.0292 0.0099
There are three discriminant functions (LD1, LD2, LD3). Let’s biplot the results of these functions:
# the function for lda biplot arrows (variables)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results, all dimensions
plot(lda.fit, dimen = 3, col = classes, pch = classes)
# plot LD1 vs. LD2
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
LD1 and LD2 are the strongest functions to separate our observations on
the target variable (as we can also see from the proportion of
trace on the summary of the LDA model output and the graphical
overview of all the dimensions LD1 through LD3).
The graphical overview of or LDA model shows that LD1 and LD2 clearly separate the group of observations: high per capita crime rate (from our target variable, crime) is separated from the other, lower quartiles. Crime rate seems to be higher when index of accessibility to radial highways is higher. Other variables don’t draw as much attention.
To predict crime rate categories with our LDA model we have to save the true crime categories from the test set, and then remove them from the data to then create a variable with the predicted crime rate classes.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable (true/correct crime rate classes) from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results %>% add sums
tbres1 <- table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
tbres1
## predicted
## correct low med_low med_high high Sum
## low 15 13 1 0 29
## med_low 3 16 6 0 25
## med_high 0 4 16 3 23
## high 0 0 1 24 25
## Sum 18 33 24 27 102
#number of rows in test set
nrow(test)
## [1] 102
Our model seems to predict crime rates quite nicely:
We want to investigate the the similarity between our observations, so we’ll create clusters of similar datapoints with running a k-means algorithm on the dataset.
We’ll first standardize the data for further analysis.
# Reload the 'Boston' data
data("Boston")
# Recenter and restandardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled <- as.data.frame(boston_scaled)
As instructed for our assignment, we’ll calculate the distances between observations to assess the similarity between our data points. We’ll calculate both euclidean and manhattan distances. It is worth noting that k-means clustering algorithm uses Euclidean distances.
# Calculate distances between the observations.
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
## Run k-means algorithm on the dataset
## k-means clustering
# K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
set.seed(123)
km <- kmeans(boston_scaled, centers = 4)
# plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
pairs(boston_scaled[6:10], col = km$cluster)
With kmeans clustering with 4 clusters there seems to be some overlap
between clusters (based on the plots). Let’s investigate the optimal
number of clusters for our model with the total
within sum of squares (WCSS) and how it changes when the
number of clusters changes.
# Investigate what is the optimal number of clusters and run the algorithm again
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line', ylab='Total within sum of squares (WCSS)', xlab='number of clusters')
The optimal number of clusters is when the total WCSS drops radically. Here, the optimal number of clusters for our model seems to be 2. Let’s run the algorithm again with 2 clusters and visualize the results.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# Visualize the clusters
# plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
pairs(boston_scaled[1:5], col = km$cluster)
pairs(boston_scaled[6:10], col = km$cluster)
pairs(boston_scaled[11:14], col = km$cluster)
We determined that grouping our data with k-means clustering is best done with two clustering centroids (-> yields two clusters). However, we do not know what these clusters actually represent.
Form the plots we can see that there are very clear separate clusters at least within rad and tax.
To determine possible reasons for the clustering result we could draw a parallel coordinates plot… See this instruction or this.
# Reload the 'Boston' data
data("Boston")
# Recenter and restandardize variables
boston_scaled_bonus <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled_bonus)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled_bonus)
## [1] "matrix" "array"
# change the object to data frame for futher analysis
boston_scaled_bonus <- as.data.frame(boston_scaled_bonus)
Let’s choose to run the kmeans algorithm with an arbitrary amount of four clusters.
## Run k-means algorithm with 6 cluster centers on the dataset
## k-means clustering
# K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that (used already in a previous chunck of code in this document)
km_b <- kmeans(boston_scaled_bonus, centers = 4)
# plot the scaled Boston dataset with clusters
#pairs(Boston, col = km$cluster)
#pairs(Boston[6:10], col = km$cluster)
#### Perform LDA using the clusters as target classes
# assess the kmeans() result summary
summary(km_b) #it is not a dataframe
## Length Class Mode
## cluster 506 -none- numeric
## centers 56 -none- numeric
## totss 1 -none- numeric
## withinss 4 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 4 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
typeof(km_b) #it is a 'list' object
## [1] "list"
str(km_b)
## List of 9
## $ cluster : Named int [1:506] 1 1 1 2 2 1 1 1 1 1 ...
## ..- attr(*, "names")= chr [1:506] "1" "2" "3" "4" ...
## $ centers : num [1:4, 1:14] -0.377 -0.408 0.859 -0.199 -0.333 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "1" "2" "3" "4"
## .. ..$ : chr [1:14] "crim" "zn" "indus" "chas" ...
## $ totss : num 7070
## $ withinss : num [1:4] 1102 623 1380 341
## $ tot.withinss: num 3446
## $ betweenss : num 3624
## $ size : int [1:4] 222 98 152 34
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# --> km seems to be a 'list' of 'lists', with 506 observations.
# More detailed info:
#?kmeans()
# subset 'cluster' from km (list) and add it as a column to our 'boston_scaled' dataframe
km_b_clust <- km_b$cluster
km_b_clust <- as.data.frame(km_b_clust)
boston_scaled_bonus <- mutate(boston_scaled_bonus, km_b_clust)
# Run linear discriminant analysis with clusters as target classes and all other variables of the data set as pexplanatory variables
lda.fit_bonus <- lda(km_b_clust ~ ., data = boston_scaled_bonus)
# print the lda.fit object
lda.fit_bonus
## Call:
## lda(km_b_clust ~ ., data = boston_scaled_bonus)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.43873518 0.19367589 0.30039526 0.06719368
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3773550 -0.3325348 -0.3302399 -0.2723291 -0.3380896 -0.07661724
## 2 -0.4083186 1.5993013 -1.0868786 -0.2321546 -1.0252997 0.85762197
## 3 0.8588075 -0.4872402 1.1204442 -0.2723291 1.0691487 -0.50270159
## 4 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.27566811
## age dis rad tax ptratio black
## 1 -0.06014166 0.07356985 -0.588184980 -0.6041027 -0.06143853 0.2840492
## 2 -1.22890399 1.28526187 -0.598658222 -0.6419128 -0.74348995 0.3532213
## 3 0.79691805 -0.84588600 1.244794751 1.3179961 0.65687472 -0.6809675
## 4 0.37213224 -0.40333817 0.001081444 -0.0975633 -0.39245849 0.1715427
## lstat medv
## 1 -0.2086903 0.03252933
## 2 -0.9364928 0.91875080
## 3 0.9453522 -0.76810974
## 4 -0.1643525 0.57334091
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.034836228 0.01232052 0.13192586
## zn -0.267361320 0.15788045 1.20991564
## indus 0.020533059 -0.58778276 0.21069583
## chas 5.828653032 0.14949639 0.09257929
## nox 0.026703754 -0.34585368 0.54345497
## rm -0.068843435 0.07156541 0.37000239
## age 0.098415418 -0.04458773 -0.53934235
## dis 0.053100700 0.23565203 0.47758361
## rad 0.064174139 -0.72837621 0.11551231
## tax 0.033261463 -0.60257854 0.70761129
## ptratio -0.007773957 -0.10230899 0.07726420
## black 0.015155025 0.03258028 -0.02842152
## lstat -0.133125916 -0.26354444 0.52415255
## medv -0.146715844 0.14530719 0.54132868
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8121 0.1472 0.0407
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes / clusters as numeric
classes_b <- as.numeric(boston_scaled_bonus$km_b_clust)
# plot the lda results
plot(lda.fit_bonus, col = classes_b, pch = classes_b)
plot(lda.fit_bonus, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows(lda.fit_bonus, myscale = 1)
plot(lda.fit_bonus, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows(lda.fit_bonus, myscale = 9)
When clustering the Boston data to 4 clusters with the k-means algorithm the most influential linear separators seem to be:
The results are somewhat different from our results with 2-cluster algorithm. When looking at the full matrix of biplots (LD1, LD2, LD3) we could suspect there being either two or five different clusters. From our earlier analyses we do know though that two clusters seem to be the best clustering option.
OBS! The results changed after running the R code again. To globally answer our assignment, we can say that the three most influential variables are the ones with the longest arrows on the LD1 vs LD2 plot. For easier assessment of the arrows, the second LD1 vs LD2 plot has scaled arrows (see myscale argument).
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# install.packages('plotly')
library('plotly')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
We drew a very cool dynamic 3D plot with the instructions our our course assignment! Let’s customize it:
# add colors
# set the color to be the crime classes of the train set
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes) %>% layout(title = 'Plot A')
To set the colors to be the clusters of the kmeans, we’ll join data with left_join() of dplyr:
# set the color to be the clusters of the kmeans (2 clusters)
# subset 'cluster' from km (list) and add it as a column to our 'boston_scaled' dataframe
km_sb_clust <- km$cluster
km_sb_clust <- as.data.frame(km_sb_clust)
boston_scaled_sb <- mutate(boston_scaled, km_sb_clust)
# merge data
train_sb = train %>% left_join(boston_scaled_sb)
# check data
head(train_sb);dim(train_sb)
## zn indus chas nox rm age dis
## 1 -0.48724019 1.0149946 -0.2723291 -0.1958536 -0.0606794 -0.1376575 -0.1761129
## 2 -0.48724019 1.5674443 -0.2723291 0.5980871 0.2467426 1.0773117 -0.7961887
## 3 0.37030254 -1.0446662 -0.2723291 0.7965722 0.7932707 1.1163897 -0.8473829
## 4 0.04872402 -0.7385595 -0.2723291 -1.2573178 -0.9829455 -1.1288166 1.2836322
## 5 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.5630867 -1.0506607 0.7863653
## 6 -0.48724019 1.0149946 -0.2723291 1.2539511 -1.9991462 1.1163897 -1.0474104
## rad tax ptratio black lstat medv crime
## 1 1.6596029 1.5294129 0.8057784 0.4406159 -0.267896200 0.05079791 high
## 2 -0.6373311 0.1706618 1.2676838 0.4202424 -0.007430722 -0.36237562 med_high
## 3 -0.5224844 -0.8558183 -2.5199404 0.3861769 -0.805631380 0.82278004 med_high
## 4 -0.6373311 -0.3752120 0.2053014 0.4406159 0.061186528 -0.55808940 med_low
## 5 -0.5224844 -0.5769480 -1.5037485 0.3705134 0.428078760 -0.09055093 med_low
## 6 1.6596029 1.5294129 0.8057784 0.1779505 2.516003640 -1.34094452 high
## crim km_sb_clust
## 1 0.2569871 1
## 2 -0.3805669 1
## 3 -0.3437607 2
## 4 -0.4043440 2
## 5 -0.4091990 2
## 6 1.2463082 1
## [1] 404 16
# target classes / clusters as numeric
classes_sb <- as.numeric(train_sb$km_sb_clust)
# Draw the 3D plot
model_predictors_sb <- dplyr::select(train_sb, -crime, -crim, -km_sb_clust)
# check the dimensions
dim(model_predictors_sb)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit$scaling
matrix_product_sb <- as.data.frame(matrix_product_sb)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
# install.packages('plotly')
library('plotly')
plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = classes_sb) %>% layout(title = 'Plot B')
Plot A shows four different colors depicting our four different crime rate quartiles: ‘low = 1’, ‘medium low = 2’, ‘medium high = 3’, ‘high = 4’. Plot B shows coloring by our optimal two clusters. Comparing plots A and B we notice that the dark blue cluster (Plot B) seems to include some observations of ‘medium low’ crime rate classification. And the yellow cluster of plot B seems to include all low, medium low and medium high crime rate classes. Some medium high crime rate classes seem to have been clustered to the dark blue cluster of plot B. In summary, our model seems to cluster separately quite nicely our high crime rate suburbs.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 12 22:37:20 2022"
Here we go again…
Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot", "psych"))
Load the needed R packages before getting to work:
#load required packages
library(tidyverse)
library(GGally)
#library(dplyr)
library(ggplot2)
library(corrplot)
#library(stringr)
library(psych)
library(FactoMineR)
# load the 'human' data
human <- read.csv("human_row_names.csv", row.names = 1)
# Alternative data url
# human <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", row.names = 1)
# explore the dataset
str(human);dim(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## [1] 155 8
Graphical overview of the data
# summary of the data
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# Summary plot matrix with correlation and density plots
p1 <- ggpairs(human, mapping = aes(alpha=0.3),
title="Descriptives and Pearson correlation",
lower = list(combo = wrap("facethist", bins = 20)))
p1
# compute the correlation matrix and visualize it with corrplot
#correaltion matrix
cor_matrix_h <- cor(human, method='spearman')
#p-values that correspond our correlation coefficients
cor_pmatrix_h <- corr.test(human, method = "spearman")$p
# visualize
corrplot(cor_matrix_h, type='upper', tl.col='black', col=COL2('BrBG'), addCoef.col='black', number.cex = 0.6, p.mat = cor_pmatrix_h, sig.level = 0.05)
All our variables are skewed. The significance values (p-values) seem to differ a little between the two plots. In our first plot p-values are determined as
Our second plot shows all Spearman’s correlation coefficients in numbers, the color and sizes of the circles, and correlations with p < 0.05 are seen stroke out.
This difference is due to ggpairs() (1st plot) using Pearson’s correlations and us assigning corrplot() (2nd plot) with Spearman’s correlation derived data. Since our variables seemed skewed, I prefer using Spearman’s correlation for bivariate correlation analysis.
There seems to be several strong and statistically significant correlations:
I have had a hard time understanding PCA, but our course material had a very clear intro to this: (Quoted directly from our course material:) *“Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).
There are two functions in the default package distribution of R that
can be used to perform PCA: princomp() and
prcomp(). The prcomp() function uses the SVD
and is the preferred, more numerically accurate method.
Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.”*
We’ll first run a PCA on raw, non-standardized data (part 2) and then on standardized data (part 3). Finally, we’ll interpret the results on part 4 of this assignment.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# create and print out a summary of pca_human (= variability captured by the principal components)
s <- summary(pca_human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 1)*100
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, choices=1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
All the data is clustered to the upper right corner of the plot. PC1 explains 100% of the total variance of the dataset. Because of a vector parallel to PC1, we know that GNI contributes mostly and only to PC1. This means that knowing the position of an observation in relation to PC1, it would be possible to have an accurate view of how the observation is related to other observations in our sample and basically we would only need GNI-information to determine that. Since the data is not scaled the high influence of GNI to PC1 is most probably due to GNI having the mos variance of the variables in our dataset!
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
# create and print out a summary of pca_human (= variability captured by the principal components)
s_hs <- summary(pca_human_std)
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# rounded percentanges of variance captured by each PC
pca_pr_hs <- round(1*s_hs$importance[2, ], digits = 1)*100
pca_pr_hs
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 50 20 10 10 10 0 0 0
# create object pc_lab to be used as axis labels
pc_lab_hs <- paste0(names(pca_pr_hs), " (", pca_pr_hs, "%)")
# draw a biplot
biplot(pca_human_std, choices=1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_hs[1], ylab = pc_lab_hs[2])
The plot with standardized data look very different from the earlier
biplot made with unstandardized data. Also, the variability captured by
the principal components seems to be distributed more realistically
between PCs. The results between unstandardized and standardized data
PCA are different since PCA requires data scaling for variable
comparison.
For the PCA with standardized data:
In summary, the data should be scaled before PCA for the analysis to be trustworthy.
I’ll elaborate on the results of the PCA with standardized data:
PC1 explains the most (53,6%) of the variability in the dataset, and the variables that mostly affect it are:
The four variables with a positive effect on PC1 are positively correlated with each other and the ones with a negative effect (two) are correlated positively with each other and negatively with the prior four.
PC2 explains the second most (16,2%) of the variability in the dataset, and the variables that mostly affect it are:
These two variables are positively correlated with each other and not with any of the variables affecting PC1 (since the arrows are perpendicular to the others).
According to our assignment the tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
We’ll first load the data (wrangled by our course professor K.V.) and then start exploring and analyzing it!
# load the data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# Structure and dimensions of the data
str(tea);dim(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] 300 36
# View data
View(tea)
Let’s visualize the data:
# summary of the data
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
# Summary plot matrix grouped by sex and age quartile
tea_long <- tea %>%
pivot_longer(cols=-c(sex,age_Q, age), names_to = "variable",values_to = 'value') %>%
group_by(sex,age_Q,variable,value) %>%
summarise(n=n())
tea_long
## # A tibble: 764 × 5
## # Groups: sex, age_Q, variable [330]
## sex age_Q variable value n
## <fct> <fct> <chr> <fct> <int>
## 1 F +60 always always 2
## 2 F +60 always Not.always 21
## 3 F +60 breakfast breakfast 8
## 4 F +60 breakfast Not.breakfast 15
## 5 F +60 dinner dinner 1
## 6 F +60 dinner Not.dinner 22
## 7 F +60 diuretic diuretic 14
## 8 F +60 diuretic Not.diuretic 9
## 9 F +60 effect.on.health effect on health 6
## 10 F +60 effect.on.health No.effect on health 17
## # … with 754 more rows
# Barplots of all variables
# visualize the dataset
pivot_longer(tea, cols = -c(age)) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Unfortunately there is so much data we don’t get good looking plots.
We’ll have to choose some of the most interesting variables to plot:
# column names to keep in the dataset
keep_columns <- c("age_Q", "sex", "Tea", "How", "price", "SPC", "Sport", "frequency", "relaxing", "healthy", "sophisticated")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, all_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## age_Q sex Tea How price
## +60 :38 F:178 black : 74 alone:195 p_branded : 95
## 15-24:92 M:122 Earl Grey:193 lemon: 33 p_cheap : 7
## 25-34:69 green : 33 milk : 63 p_private label: 21
## 35-44:40 other: 9 p_unknown : 12
## 45-59:61 p_upscale : 53
## p_variable :112
##
## SPC Sport frequency relaxing
## employee :59 Not.sportsman:121 +2/day :127 No.relaxing:113
## middle :40 sportsman :179 1 to 2/week: 44 relaxing :187
## non-worker :64 1/day : 95
## other worker:20 3 to 6/week: 34
## senior :35
## student :70
## workman :12
## healthy sophisticated
## healthy :210 Not.sophisticated: 85
## Not.healthy: 90 sophisticated :215
##
##
##
##
##
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Again, quoted a very insightful part of our course material: Multiple Correspondence Analysis (MCA) is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.
Since there are multiple variables. We’ll choose and focus on only the ones we previously chose: “age_Q”, “sex”, “Tea”, “How”, “price”, “SPC”, “Sport”, “frequency”, “relaxing”, “healthy”, “sophisticated” and run a MCA.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = TRUE)
## Warning: ggrepel: 21 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# summary of the model
mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 11 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "intermediate results"
## 12 "$call$marge.col" "weights of columns"
## 13 "$call$marge.li" "weights of rows"
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = TRUE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.205 0.179 0.138 0.129 0.126 0.125 0.114
## % of var. 8.059 7.023 5.404 5.073 4.946 4.910 4.486
## Cumulative % of var. 8.059 15.082 20.485 25.558 30.504 35.414 39.899
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.108 0.104 0.102 0.099 0.096 0.090 0.089
## % of var. 4.251 4.068 3.994 3.889 3.790 3.521 3.508
## Cumulative % of var. 44.150 48.218 52.212 56.101 59.891 63.412 66.920
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.085 0.082 0.076 0.074 0.071 0.068 0.065
## % of var. 3.353 3.209 3.003 2.896 2.809 2.667 2.536
## Cumulative % of var. 70.273 73.482 76.485 79.382 82.190 84.857 87.393
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.061 0.056 0.055 0.053 0.044 0.028 0.024
## % of var. 2.406 2.208 2.160 2.071 1.720 1.091 0.952
## Cumulative % of var. 89.798 92.007 94.166 96.237 97.957 99.048 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | 0.205 0.068 0.009 | 0.966 1.740 0.207 | 0.345 0.289 0.027
## 2 | 0.298 0.145 0.036 | 0.576 0.618 0.135 | 0.099 0.024 0.004
## 3 | -0.118 0.023 0.006 | 0.135 0.034 0.007 | 0.034 0.003 0.000
## 4 | -0.497 0.401 0.183 | -0.191 0.068 0.027 | -0.017 0.001 0.000
## 5 | -0.237 0.091 0.031 | 0.274 0.140 0.042 | 0.078 0.015 0.003
## 6 | -0.800 1.040 0.253 | 0.002 0.000 0.000 | 0.364 0.322 0.053
## 7 | -0.020 0.001 0.000 | 0.542 0.549 0.107 | 0.441 0.472 0.070
## 8 | 0.197 0.063 0.014 | 0.231 0.099 0.019 | 0.337 0.275 0.041
## 9 | 0.195 0.062 0.014 | 0.569 0.603 0.119 | 0.617 0.922 0.140
## 10 | 0.496 0.400 0.090 | 0.323 0.195 0.038 | 0.553 0.742 0.112
##
## 1 |
## 2 |
## 3 |
## 4 |
## 5 |
## 6 |
## 7 |
## 8 |
## 9 |
## 10 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## +60 | 1.564 13.735 0.355 10.301 | -1.327 11.337 0.255 -8.737 |
## 15-24 | -1.100 16.446 0.535 -12.650 | -0.578 5.203 0.148 -6.643 |
## 25-34 | -0.125 0.159 0.005 -1.182 | 0.668 5.214 0.133 6.310 |
## 35-44 | 0.543 1.741 0.045 3.682 | 0.521 1.842 0.042 3.535 |
## 45-59 | 0.470 1.992 0.056 4.107 | 0.601 3.730 0.092 5.247 |
## F | -0.088 0.203 0.011 -1.836 | -0.402 4.872 0.236 -8.393 |
## M | 0.128 0.297 0.011 1.836 | 0.586 7.108 0.236 8.393 |
## black | 0.708 5.484 0.164 7.008 | -0.092 0.107 0.003 -0.913 |
## Earl Grey | -0.375 4.017 0.254 -8.716 | 0.000 0.000 0.000 -0.010 |
## green | 0.607 1.795 0.046 3.689 | 0.209 0.245 0.005 1.272 |
## Dim.3 ctr cos2 v.test
## +60 0.600 3.011 0.052 3.950 |
## 15-24 0.209 0.882 0.019 2.399 |
## 25-34 -0.358 1.945 0.038 -3.380 |
## 35-44 -0.020 0.003 0.000 -0.134 |
## 45-59 -0.271 0.984 0.019 -2.364 |
## F -0.247 2.393 0.089 -5.159 |
## M 0.360 3.491 0.089 5.159 |
## black 0.673 7.394 0.149 6.664 |
## Earl Grey -0.130 0.719 0.030 -3.020 |
## green -0.750 4.086 0.069 -4.557 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## age_Q | 0.769 0.537 0.103 |
## sex | 0.011 0.236 0.089 |
## Tea | 0.255 0.007 0.185 |
## How | 0.137 0.083 0.061 |
## price | 0.106 0.115 0.351 |
## SPC | 0.697 0.618 0.328 |
## Sport | 0.140 0.061 0.104 |
## frequency | 0.019 0.135 0.104 |
## relaxing | 0.069 0.120 0.034 |
## healthy | 0.033 0.030 0.081 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali") #individuals (ind) invisible = plots variables
plot(mca, invisible=c("var"), graph.type = "classic") #variables (var) invisible = plots individuals
Our first and fourth plots (MCA factor map) show variable categories with similar profiles grouped together;
‘non-workers’, ‘+60’ year olds and drinking tea in an ‘other’ way (i.e. not alone, not with lemon nor milk) have similar profiles. Also ‘student’ and ‘15-24’ year old seem to have similar profiles.
‘non-workers’, ‘+60’ year olds and ‘other’ profiles seem to be negatively correlated with ‘student’ and ‘15-24’ yo categories (opposite sides of plot origin), and especially with ‘25-34’ year old and ‘workman’.
‘non-workers’, ‘+60’ year olds, ‘other’ kind of drinkers, ‘student’, ‘15-24’ year olds, ‘senior’, and ‘p_unknown’ (tea price unknown) seem to be the variable categories that are most represented in our data (tehy are the farthers of the origo).
From the third plot (correlation between variables and principal dimensions) we can see that
To assess whether the variable categories differ significantly from each other we’ll draw confidence ellipses, which is possible with ‘FactoMineR’:
plotellipses(mca)
The plots are hard to read, so we’ll focus on specific variables 4 at a time.
plotellipses(mca, keepvar = c("sex", "Tea", "age_Q", "How"))
plotellipses(mca, keepvar = c("Sport", "price", "frequency", "SPC"))
plotellipses(mca, keepvar = c("relaxing", "healthy", "sophisticated"))
This shows us:
✓ ✓ ✓ ✓ ✓ Ch5 done!
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 12 22:37:41 2022"
Here we go again!
Sweet, sweet longitudinal data…
For longitudinal data we cannot assume observations to be independent of each other, since longitudinal data means multiple observations of the same individuals. For this assignment we’ll first do some data wrangling to transform our data form wide form to long for. Then we’ll analyse our data with linear mixed effects statistical models; a random intercept model and a random intercept and slope model.
Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot", "psych"))
Load the needed R packages before getting to work:
#load required packages
library(tidyverse)
#library(GGally)
#library(dplyr)
library(ggplot2)
#library(corrplot)
#library(stringr)
#library(psych)
#library(FactoMineR)
library(lme4)
See meet_and_repeat.R on my repository for the data wrangling part.
For the first part of the analysis, graphical displays and summary measure approach to longitudinal data we’ll use the wrangled ‘rats’ dataset now found in my repository.
# read long format data of rats
ratsL <- read.csv('rats_long.csv', row.names = 1)
# glimpse and dimensions
head(ratsL);dim(ratsL)
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
## [1] 176 5
# structure and summary of the data
str(ratsL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
summary(ratsL)
## ID Group WD Weight
## Min. : 1.00 Min. :1.00 Length:176 Min. :225.0
## 1st Qu.: 4.75 1st Qu.:1.00 Class :character 1st Qu.:267.0
## Median : 8.50 Median :1.50 Mode :character Median :344.5
## Mean : 8.50 Mean :1.75 Mean :384.5
## 3rd Qu.:12.25 3rd Qu.:2.25 3rd Qu.:511.2
## Max. :16.00 Max. :3.00 Max. :628.0
## Time
## Min. : 1.00
## 1st Qu.:15.00
## Median :36.00
## Mean :33.55
## 3rd Qu.:50.00
## Max. :64.00
# Convert categorical data to factors
## ID and Group
ratsL$ID <- factor(ratsL$ID)
ratsL$Group <- factor(ratsL$Group)
We have a dataset with 6 rats and 11 weight (grams) observations per each rat, on different weeks. The rats are divided in 3 different groups, meaning three different diets. Our data is in a longitudinal format which means that we can study possible weight differences between the rats in these 3 different groups and the possible change of weight by time.
Let’s plot the Weights of each rat by time and groups.
# Draw the plot
ggplot(ratsL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(ratsL$Weight), max(ratsL$Weight)))
The weight of each rat increases though the days. Group 1 has the most
rats, group 2 and 3 include both 4 rats. Both the baseline and final
weights of the rats increase by group, group 1 having rats with the
lowest weights and group 3 the highest. One rat in group 2 has the
highest baseline and final weight.
Standardizing to assess the tracking phenomenon
Higher baseline weight may mean higher final weight. This is called the tracking phenomenon. The plots drawn earlier sufggest this might be the case in our data. We’ll can assess this more clearly with plotting standardized data.
We’ll standardize our data with scale():
\[standardised(x) = \frac{x - mean(x)}{ sd(x)}\]
# Standardise the variable Weight by groups´
ratsL <- ratsL %>%
group_by(Time) %>%
mutate(Weight_std= (scale(Weight))) %>%
ungroup()
# Glimpse the data
head(ratsL)
## # A tibble: 6 × 6
## ID Group WD Weight Time Weight_std[,1]
## <fct> <fct> <chr> <int> <int> <dbl>
## 1 1 1 WD1 240 1 -1.00
## 2 2 1 WD1 225 1 -1.12
## 3 3 1 WD1 245 1 -0.961
## 4 4 1 WD1 260 1 -0.842
## 5 5 1 WD1 255 1 -0.882
## 6 6 1 WD1 260 1 -0.842
# Plot again with the standardised weight
ggplot(ratsL, aes(x = Time, y = Weight_std, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "standardized Weight")
After standardizing our data the difference between the baseline and final weights looks smaller in the plot.
Summary graphs
As our course material indicates, we can produce graphs showing average (mean) profiles for each group along with some indication of the variation of the observations at each time point (the standard error of mean):
\[se = \frac{sd(x)}{\sqrt{n}}\]
# Summary data with mean and standard error of bprs by treatment and week
ratsLs <- ratsL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = (sd(Weight)/sqrt(length(Weight))) ) %>%
ungroup()
# Glimpse the data
glimpse(ratsLs)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.87…
# Plot the mean profiles
ggplot(ratsLs, aes(x = Time, y = mean, color=Group, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
# legend inside plot
#theme(legend.position = c(0.8,0.4)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
There is no significant overlap between the groups, although there is
definitely less difference between groups 2 and 3 compared to groups 2
or 3 to group 3. There is also the least variation between the weights
of rats in group 3 and most between the weights of rats in group 2. The
weight of the rats increases in each group during the study.
Summary measure approach
Let’s look at the weight measurements after the start of the study (= we’ll exclude baseline - i.e. week 1- weights). We’ll calculate the mean weight for each rat, and then draw boxplots of the mean weight for each diet group.
# Create a summary data by group and subject with mean as the summary variable (ignoring baseline week 1)
ratsL_sum <- ratsL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise(Weight_mean=mean(Weight)) %>%
ungroup()
# Glimpse the data
glimpse(ratsL_sum)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Weight_mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 44…
# Draw a boxplot of the mean versus diet
ggplot(ratsL_sum, aes(x = Group, y = Weight_mean, col=Group)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), days 8-64")
The mean summary measure is more variable in the second treatment group
and its distribution in this group is somewhat skew. All of the three
groups seem to have an outlier - the mean weight of the rat is more than
1.5 times the interquartile range above the upper quartile or below the
lower quartile. These outliers might bias the conclusions from the later
comparisons of the groups so we’ll remove them from out analysis and
redraw the boxplots. Outlier in group 3 is within the filtering limits
(min and max of the means of all groups) so we’ll have to search the rat
ID for that outlier to exclude it from the group.
# define outlier from group 3
g3 <- ratsL_sum %>% filter(Group==3)
out3 <- min(g3$Weight_mean)
# Create a new data by filtering the outliers
ratsL_sum2 <- ratsL_sum %>%
filter(250 < Weight_mean & Weight_mean < 560 & Weight_mean != out3)
# Draw a boxplot of the mean versus diet
ggplot(ratsL_sum2, aes(x = Group, y = Weight_mean, col=Group)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), days 8-64")
There is less within group variation after excluding the outliers and we
may proceed with our analysis.
T-test or analysis of variance (ANOVA) or analysis of covariance (ANCOVA)
Based on our previous plots, there seems to be a difference in weights in our diet groups. We’ll now test if there is a statistical difference and whether it is statistically significant.
T-test tests whether there is a statistical difference between two groups and ANOVA test whether there is a difference between three or more groups. Thus, we’ll use ANOVA for our analysis.
ANOVA assumes homogeneity of variance - the variance in groups 1-3 should be similar. We’ll test this with Levene’s test before proceeding further with ANOVA.
# Levene's test from 'car' package
library(car)
leveneTest(Weight_mean ~ Group, ratsL_sum2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.3727 0.698
## 10
The Levene’s test returned a nonsignificant p value for the difference of variances between groups: we’ll accept the test’s null hypothesis -> there is no difference between the variance of the groups 1-3. > We may proceed with ANOVA.
# Fit the linear model with the mean Weight as the response
fit <- lm(Weight_mean ~ Group, data = ratsL_sum2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: Weight_mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 176917 88458 2836.4 1.687e-14 ***
## Residuals 10 312 31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This shows there seems to be a significant difference in mean weights between diet groups.
However, we should take into account baseline weight to assess the differences in mean weights of groups 1-3 while accounting for the tracking phenomenon. We’ll add the baseline weight from the original dataset rats and run ANCOVA.
# Load original wide form rats data
rats_og <- as_tibble(read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t'))
rats_og$ID <- factor(rats_og$ID)
#rats_og$Group <- factor(rats_og$Group)
# Add the baseline from the original data as a new variable to the summary data
join_vars <- c("ID","WD1")
ratsL_sum3 <- ratsL_sum2 %>%
left_join(rats_og[join_vars], by='ID')
# Rename column
ratsL_sum3 <- ratsL_sum3 %>%
rename('Weight_baseline' = 'WD1')
# Fit the linear model with the mean Weight as the response
fit2 <- lm(Weight_mean ~ Weight_baseline + Group, data = ratsL_sum3)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit2)
## Analysis of Variance Table
##
## Response: Weight_mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Weight_baseline 1 174926 174926 6630.941 3.216e-14 ***
## Group 2 2065 1033 39.147 3.628e-05 ***
## Residuals 9 237 26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Baseline weight is strongly correlated with mean weight later in the study. However, diet (Group) seems to keep it’s statistically significant association to weight even after taking baseline weight into account. The result is in line with our graphical overview of the dataset after exclusion of outliers.
In part II of this assignment we’ll use the dataset BPRS, which includes psychiatric patient data. It includes a brief psychiatric rating scale (BPRS) score prior to treatment and BPRS from 8 weeks during treatment. The patients (n=40) have been randomly assigned to treatment arm 1 or 2 and we are interested whether there is a difference in BPRS scores depending on the received treatment. A lower score means less symptoms.
We’ll first explore the data and then analyse it to determine whether the BPRS scores of the two treatment groups differ.
# read in the data
BPRSL <- read.csv('BPRS_long.csv', row.names=1)
head(BPRSL);dim(BPRSL)
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
## [1] 360 5
View(BPRSL)
# Factor variables subject and treatment group
BPRSL$subject <- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)
# Glimpse the data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
From glimpsing and viewing the data we notice that in treatment arms the subjects are coded 1-20. However, these individuals with the same subject code are different individuals since all participating patients were randomized to either treatment 1 or treatment 2. To avoid mixups in our analysis we’ll have to create a new individual-specific factor ID-variable.
# New ID-variable
BPRSL <- BPRSL %>%
mutate(ID = as.factor(paste0(subject, "_t",treatment)))
# check data
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ID <fct> 1_t1, 2_t1, 3_t1, 4_t1, 5_t1, 6_t1, 7_t1, 8_t1, 9_t1, 10_t1,…
summary(BPRSL)
## treatment subject weeks bprs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
## ID
## 1_t1 : 9
## 1_t2 : 9
## 10_t1 : 9
## 10_t2 : 9
## 11_t1 : 9
## 11_t2 : 9
## (Other):306
Let’s plot!
# Plot the data
ggplot(BPRSL, aes(x = week, y = bprs, group = ID, col=treatment)) +
geom_line(aes(linetype=treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0,8,2)) +
scale_y_continuous(name="BPRS") +
theme(legend.position = "right")
BPRS seems to decrease during the treatment period in both treatment arms. There is no clear difference to be seen between the groups. We’ll divide the plot into two separate plots by treatment.
Still, we cannot draw trustworthy conclusions of these two treatment
arms affecting BPRS. Let’s proceed.
Linear Mixed Effects Models
Now things are getting serious. LET’S GO!
Linear model
When fitting a linear model to our data we assume independence of the observations (here BPRS measurements). Since the data is longitudinal we know they are not independent. We’ll ignore this knowledge for now and proceed with multiple regression modeling!
# create a regression model BPRSL_reg
BPRSL_reg <- lm(bprs ~ week + treatment, data=BPRSL)
# print out a summary of the model
summary(BPRSL_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
BPRS score is our target variable and time (weeks) and treatment arm are our explanatory variables. Week seems to have a statistically significant association with BPRS score. Treatment arms seem to not differ from each other when comparing BPRS scores. However, this analysis assumes independence of observations which is highly unlikely in our dataset provided we are studying repeated measures of BPRS scores (longitudinal data).
We’ll now proceed to analysis where the assumption of independence within observations is not required, we’ll start with running a random intercept model to assess whether there is a difference between the two treatment arms. Fitting a random intercept model allows the linear regression fit for each individual to differ in intercept from other individuals. The model takes into account both fixed-effects and random effects of our explanatory variables on our target variable.
The Random Intercept Model
# Create a random intercept model
BPRSL_ref <- lmer(bprs ~ week + treatment + (1 | ID), data = BPRSL, REML = FALSE)
# Print the summary of the model
BPRSL_ref
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | ID)
## Data: BPRSL
## AIC BIC logLik deviance df.resid
## 2582.903 2602.334 -1286.452 2572.903 355
## Random effects:
## Groups Name Std.Dev.
## ID (Intercept) 9.869
## Residual 7.364
## Number of obs: 360, groups: ID, 40
## Fixed Effects:
## (Intercept) week treatment2
## 46.4539 -2.2704 0.5722
summary(BPRSL_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | ID)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: ID, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
confint(BPRSL_ref)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 7.918218 12.645375
## .sigma 6.828332 7.973573
## (Intercept) 41.744598 51.163179
## week -2.565919 -1.974914
## treatment2 -5.885196 7.029641
In our dataset the minimal BPRS score is 18 and the maximum BPRS score is 95. Variance on ID level is high (97.4) and the standard deviation small (9.9). This confirms that there is high variance between the individual patients’ BPRS scores, which tells us that their baseline symptoms are of very different levels.
One unit change in time (= 1 week passes) is associated with a decrease of 2.3 BPRS points. Individuals on treatment 2 seem to have 0.57 higher BPRS scores but its confidence intervals imply the treatment arm do not differ significantly (CI95% reach both sides of 0). These results are similar to the ones from out linear model. Only the standard error (SE) for time (week) is lower with his model and the SE for treatment group is higher with this model compared to the previous linear model. Which also insinuates that time seems to be statistically significantly associated with BPRS score and treatment arm not.
Let’s fit the random intercept and random slope model to our data:
Random Intercept and Random Slope Model
A nice and clear quote from out material: Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This allows us to account for the individual differences in the individuals symptom (BRPS score) profiles, but also the effect of time.
# create a random intercept and random slope model
BPRSL_ref1 <- lmer(bprs ~ week + treatment + (week | ID), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRSL_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | ID)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: ID, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
confint(BPRSL_ref1)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 10.3102841 16.7099130
## .sig02 -0.8241816 -0.4128486
## .sig03 1.1562013 2.0277746
## .sigma 5.5925524 6.6009205
## (Intercept) 40.5849755 51.2468100
## week -2.8150999 -1.7257334
## treatment2 -4.9313699 7.9592153
# perform an ANOVA test on the two models to assess formal differences between them
anova(BPRSL_ref1, BPRSL_ref)
## Data: BPRSL
## Models:
## BPRSL_ref: bprs ~ week + treatment + (1 | ID)
## BPRSL_ref1: bprs ~ week + treatment + (week | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref 5 2582.9 2602.3 -1286.5 2572.9
## BPRSL_ref1 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding the random slope to this model increased the inter individual variance. Treatment groups seems to stay unsignificant when it comes to BPRS over time. The ANOVA test between our models shows that our model with the random slope argument seems to be a better fit to assess BPRS in these data.
Let’s take interaction of out explanatory variables into account:
Random Intercept and Random Slope Model with interaction
We’ll now fit a random intercept and slope model that allows for a treatment × time interaction, and plot the result.
# create a random intercept and random slope model with the interaction
# week * treatment = all effects and interactions of time and treatment = week + treatment + week:treatment
# random effect that allows individuals (ID) to vary randomly in terms of their baseline BPRS-value and their effects on BPRS over time (="their individual effect of Time on BPRS" = random slope)
BPRSL_ref2 <- lmer(bprs ~ week + treatment + week * treatment + (week | ID), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRSL_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | ID)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: ID, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
confint(BPRSL_ref2)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 10.2191216 16.4882560
## .sig02 -0.8181446 -0.4021022
## .sig03 1.1186279 1.9764684
## .sigma 5.5925518 6.6009204
## (Intercept) 41.8936922 53.8774190
## week -3.3816817 -1.8749850
## treatment2 -10.7648856 6.1826634
## week:treatment2 -0.3495621 1.7812288
# perform an ANOVA test on the two models
anova(BPRSL_ref2, BPRSL_ref1)
## Data: BPRSL
## Models:
## BPRSL_ref1: bprs ~ week + treatment + (week | ID)
## BPRSL_ref2: bprs ~ week + treatment + week * treatment + (week | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref1 7 2523.2 2550.4 -1254.6 2509.2
## BPRSL_ref2 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
The results of our latest model (random intercept and slope, and interaction variable) seems to be similar to our second model (random intercept and slope, no interaction variable). ANOVA between the two models confirms there is no significant difference between them in our data. In summary, this time adding the interaction variable did not improve our model.
Next, we’ll plot the observed BPRS values and the fitted BPRS values.
# draw the plot of BPRSL with the observed BPRS values
ggplot(BPRSL, aes(x = week, y = bprs, group = ID, col=treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
scale_y_continuous(name = "Observed BPRS") +
theme(legend.position = "right") +
facet_grid(. ~ treatment, labeller=label_both)
# Create a vector of the fitted values
Fitted <- fitted(BPRSL_ref)
Fitted1 <- fitted(BPRSL_ref1)
Fitted2 <- fitted(BPRSL_ref2)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(bprs_fitted_values_BPRSL_ref = Fitted, bprs_fitted_values_BPRSL_ref1 = Fitted1, bprs_fitted_values_BPRSL_ref2 = Fitted2)
head(BPRSL)
## treatment subject weeks bprs week ID bprs_fitted_values_BPRSL_ref
## 1 1 1 week0 42 0 1_t1 50.39349
## 2 1 2 week0 58 0 2_t1 53.42798
## 3 1 3 week0 54 0 3_t1 46.52190
## 4 1 4 week0 55 0 4_t1 61.06652
## 5 1 5 week0 72 0 5_t1 60.96188
## 6 1 6 week0 48 0 6_t1 44.74306
## bprs_fitted_values_BPRSL_ref1 bprs_fitted_values_BPRSL_ref2
## 1 39.66964 40.11147
## 2 63.89902 64.08993
## 3 52.23190 52.47526
## 4 62.50789 62.81131
## 5 74.47847 74.63778
## 6 46.17363 46.46693
# draw the plot of BPRSL with the Fitted values of bprs model 1
ggplot(BPRSL, aes(x = week, y = bprs_fitted_values_BPRSL_ref, group = ID, col=treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
scale_y_continuous(name = "Fitted BPRS (model 1: rnd intercept)") +
theme(legend.position = "right") +
facet_grid(. ~ treatment, labeller=label_both)
# draw the plot of BPRSL with the Fitted values of bprs model 2
ggplot(BPRSL, aes(x = week, y = bprs_fitted_values_BPRSL_ref1, group = ID, col=treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
scale_y_continuous(name = "Fitted BPRS (model 2: rnd intercept + slope)") +
theme(legend.position = "right") +
facet_grid(. ~ treatment, labeller=label_both)
# draw the plot of BPRSL with the Fitted values of bprs model 3
ggplot(BPRSL, aes(x = week, y = bprs_fitted_values_BPRSL_ref1, group = ID, col=treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 4, 8)) +
scale_y_continuous(name = "Fitted BPRS (model 3: rnd intercept + slope + interaction)") +
theme(legend.position = "right") +
facet_grid(. ~ treatment, labeller=label_both)
We can see how model 1 (random intercept) differs from model 2 (random intercept & slope) and model 3 (random intercept & slope + interaction). Model 1 does not take into account the effect of individual effect on BPRSs over time, as do model 2 and 3. Models 2 and 3 yield similar results as we determined from our previous analysis.
WOAH! DONE! ✓ Thoughest modeling so far, but we made it! I loved the course, would redo it anytime (well..maybe earliest after at least a two week break.) I learned so much, and more than I thought I could! HUZZAH!
Merry Christmas!
I have been Rlightened. ┌( ಠ_ಠ)┘
(more chapters to be added similarly as we proceed with the course!)